Posting Paper on the Web
نویسنده
چکیده
We present a document processing system that accepts scanned images of paper documents as input and outputs hyperlinked electronic documents. The system segments document images, separating text from graphics, recognizes text, and creates hypertext links between document components (text, images, graphics). By (1) limiting input to popular Times-Roman and Helvetica fonts found in first-generation scans of columnated magazines and tabloids, and using (2) gray scale attributes, (3) multiple character prototypes to recognize kerned and touching characters, (4) a lexicon to find and correct recognition errors, and (5) providing user interaction to recognize problem words, we achieve OCR accuracies up to 99.8%. This compares closely to our measurements of human proofreading accuracy (99.94%) which, however, takes six times longer. A simple method for automated selection of important words in a document and creation of hypertext links from those words to other document components is developed to provide high-level searching and browsing of the document. Browsing granularity (i.e. the number and types of linked words) is user-selectable. The appropriateness of automatically selected link anchors is comparable to human performance.
منابع مشابه
The ethics of DeCSS posting: towards assessing the morality of the Internet posting of DVD copyright circumvention software
Introduction. We investigate the conditions under which posting software known as "DeCSS" on the Internet is ethical. DeCSS circumvents the access and copy control protection measures on commercial DVDs. Through our investigation, we point to limitations in current frameworks used to assess ethical computer based civil disobedience. Method. The paper draws on empirical findings of actual DeCSS ...
متن کاملOn Inverted Index Compression for Search Engine Efficiency
Efficient access to the inverted index data structure is a key aspect for a search engine to achieve fast response times to users’ queries. While the performance of an information retrieval (IR) system can be enhanced through the compression of its posting lists, there is little recent work in the literature that thoroughly compares and analyses the performance of modern integer compression sch...
متن کاملA Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Selfie-posting Behaviours
Selfies have become increasingly fashionable in the social media era. People are willing to share their selfies in various social media platforms such as Facebook, Instagram and Flicker. The popularity of selfie have caught researchers’ attention, especially psychologists. In computer vision and machine learning areas, little attention has been paid to this phenomenon as a valuable data source....
متن کاملThe More You Know: Information Effects on Job Application Rates by Gender in a Large Field Experiment∗
This paper presents the results from a 2.3 million person field experiment that varies whether a job seeker is shown the number of applicants for a job posting on a large job posting website, LinkedIn. This intervention increases the likelihood a person will start/finish an application by 0.6%-1.9%, representing an economically significant potential increase of over a thousand applications per ...
متن کاملFive Years of Experience with a World-Wide-Web-based Job Directory for Neonatal-Perinatal Health Care
This poster describes the implementation and five-year utilization history for a free Web-based jobs directory used by health care professionals in neonatal-perinatal medicine. The World-Wide-Web site "Neonatology on the Web" (NOTW) has been in continuous operation since Fall of 1995. NOTW is dedicated to the information needs of professionals in neonatalperinatal medicine and is the most heavi...
متن کاملDevelopment of Real Time Synchronous Web Application for Posting and Utilizing Disaster Information
In a large earthquake, rescue operations and fire-fighting are obstructed by firespreading and street-blockages. Therefore, it is important to quickly collect and utilize disaster information for disaster mitigation. In this paper, firstly, we develop a Web application for posting and viewing information collected by users in real time. Using this system, it is possible not only to easily share...
متن کامل